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ABSTRACT 

Summary: Database search is the most widely used approach for 
peptide and protein identification in mass spectrometry-based prote- 
omics studies. Our previous study showed that sample-specific pro- 
tein databases derived from RNA-Seq data can better approximate 
the real protein pools in the samples and thus improve protein iden- 
tification. More importantly, single nucleotide variations, short insertion 
and deletions and novel junctions identified from RNA-Seq data make 
protein database more complete and sample-specific. Here, we report 
an R package customProDB that enables the easy generation of cus- 
tomized databases from RNA-Seq data for proteomics search. This 
work bridges genomics and proteomics studies and facilitates cross- 
omics data integration. 

Availability and implementation: customProDB and related docu- 
ments are freely available at http://bioconductor.Org/packages/2.13/ 
bioc/html/customProDB.html. 
Contact: bing.zhang@vanderbilt.edu 

Supplementary information: Supplementary data are available at 
Bioinformatics online. 

Received on June 6, 2013; revised on September 4, 2013; accepted 
on September 16, 2013 

1 INTRODUCTION 

Sequence database search is the primary method for peptide and 
protein identification in mass spectrometry-based shotgun prote- 
omics (Nesvizhskii, 2010). The completeness and specificity of 
the sequence databases directly affect the searching results. We 
recently showed that sample-specific databases derived from 
RNA-Seq data can better represent the real protein catalogs 
in biological samples and thus improve protein identification. 
In addition, sample-specific databases allow the identification 
of variant peptides (Wang et aL, 2012). 

With the advancements of both shotgun proteomics and next- 
generation sequencing (NGS) technologies, many researchers 
have started to apply both technologies to the same samples in 
parallel to gain a multi-dimensional understanding of cellular 
systems (Chen et aL, 2012; Nagaraj et aL, 2011). Even for prote- 
omics studies without corresponding RNA-Seq data, it is highly 
likely to find sequencing data (e.g. whole-genome sequencing, 
exome sequencing) for similar samples. Here, we report an 
R package, customProDB, which is dedicated to generate 
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customized database from NGS data, with a focus on RNA- 
Seq data, for proteomics search. 

Based on the assumption that undetected or lowly expressed 
transcripts are less likely to produce detectable proteins, the 
package allows users to filter out proteins with undetected or 
lowly expressed transcripts. It also allows users to incorporate 
single nucleotide variations, short insertion and deletions and 
novel junctions identified from RNA-Seq data into the protein 
database. 

Figure 1 illustrates the overall structure of the package. Methods 
and functions implemented in the customProDB package are 
described in detail in a tutorial available as online Supplementary 
Material (Supplementary File 1). In section 2, we briefly present the 
main functionalities of the customProDB package. 

2 DESCRIPTION 

2.1 Preparing annotation files 

For model organisms, customProDB allows users to download 
annotation data from the University of California, Santa Cruz 
(UCSC) table browser using r tracklayer (Lawrence et aL, 2009) 
or from ENSEMBL using biomaRt (Durinck et aL, 2009) and 
then process them to generate a standardized data structure. For 
non-model organisms, users can manually provide the annota- 
tion data in the format of UCSC or ENSEMBL. 

2.2 Building customized protein databases 

2.2.1 Input data customProDB requires a Binary- sequence 
Alignment Format (BAM) file and a Variant Call Format 
(VCF) file as input for each sample of interest. The latter can 
be generated from a BAM file using single nucleotide poly- 
morphism calling tools such as SAMtools and The Genome 
Analysis Toolkit (GATK). customProDB also accepts transcript 
expression estimates when available. 

For junction analysis, a Browser Extensible Data (BED) file 
that contains putative splice junctions is needed. This file can be 
generated by software such as Tophat (Trapnell et aL, 2009) during 
read alignment. 

2.2.2 Expression filter For a given BAM file, the 
calculateRPKM function computes the reads per kilobase per 
million reads sequenced (RPKM) for each transcript based on 
reads mapped to the exon region. Then the Outputproseq func- 
tion outputs a FASTA file for proteins with an RPKM value 
greater than a user-defined cutoff. 
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Fig. 1. Schematic overview of the customProDB package 
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Fig. 2. Consensus protein database generation and proteomics search 
results for three independent colon cancer patients (400T, 782T, 823T) 



For generating a consensus database from n (n>\) related 
samples, a function Output sharedpro is provided to output pro- 
tein sequences for transcripts with an RPKM value greater than 
a user-defined cutoff in k (1 <k<n) out of the n samples. 

2.2.3 Variation annotation The InputVcf function generates a 
G Range (Lawrence et ai, 2013) object from a VCF file, which 
contains variation information from one or multiple samples. 
The Multiple_VCF function outputs a GRange object with vari- 
ations presenting in multiple samples. 

For a given GRange object, the Varlocation function provides 
an overview of the genomic locations for all variations. Then 
protein level variations are identified for both single nucleotide 
variations and short insertion and deletions. 

VCF files derived from whole-genome or exome sequencing 
data can also be used to generate customized databases. 

2.2.4 Junction analysis Based on an input BED file that con- 
tains splice junctions derived from RNA-Seq data, the function 
JunctionType classifies all junctions into different categories. 
Then the function OutputNovelJun can be used to generate 
three-frame translated peptide sequences for all putative novel 
junctions. 

3 APPLICATION 

The development of the customProDB package was mainly 
driven by two demands: (i) to provide a customized protein data- 
base from RNA-Seq data for a specific sample, and (ii) to pro- 
vide a consensus database from a pool of genetically similar 
samples. Therefore, we provide two integrated functions to 
help accomplish these tasks in a single step (Fig. 1). 



The value of customized databases for individual samples has 
already been demonstrated (Wang et al, 2012). Here, we provide 
an example of a consensus database. A consensus database was 
generated based on RNA-Seq data from 64 colon cancer samples 
from The Cancer Genome Atlas project (TCGA, 2012). 
Previously published proteomics data from three colon cancer 
patients (Li et al, 2011) were searched against the consensus 
database (Fig. 2). By including variation and novel junction 
information in the consensus database, we were able to identify 
variant peptides and novel junction peptides from the proteomics 
datasets (Supplementary File 2 and 3). We did not gain signifi- 
cant improvements in protein identification by applying the 
transcript expression threshold, possibly because of the high 
inter-patient heterogeneity. However, compared with the regular 
REfSeq database search, more peptide- spectrum matches were 
identified using the consensus database. This example shows the 
potential of using a consensus database to capture protein 
features shared by a cohort of samples. 

4 CONCLUSION 

The huge amount of genomic and transcriptomic data available 
from NGS experiments has enhanced and will continue to 
enhance shotgun proteomics studies. However, it is non- trivial 
for ordinary proteomics researchers to use such data directly. 
The customProDB package fills this gap by providing an efficient 
tool to generate customized protein databases using expression 
and variation information available from NGS data. 
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